Overview

Dataset Statistics

Number of Variables 2
Number of Rows 3321
Missing Cells 5
Missing Cells (%) 0.1%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 604.7 MB
Average Row Size in Memory 186.4 KB
Variable Types
  • Numerical: 1
  • Categorical: 1

Dataset Insights

ID is uniformly distributed Uniform
TEXT has a high cardinality: 1920 distinct values High Cardinality

Variables


ID

numerical

Approximate Distinct Count 3321
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 53136
Mean 1660
Minimum 0
Maximum 3320
Zeros 1
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • ID is uniformly distributed

Quantile Statistics

Minimum 0
5-th Percentile 157.7
Q1 821.7
Median 1651.71
Q3 2481.71
95-th Percentile 3145.91
Maximum 3320
Range 3320
IQR 1660.01

Descriptive Statistics

Mean 1660
Standard Deviation 958.8344
Variance 919363.5
Sum 5.5129e+06
Skewness 0
Kurtosis -1.2
Coefficient of Variation 0.5776
  • ID is not normally distributed (p-value 0.0)

TEXT

categorical

Approximate Distinct Count 1920
Approximate Unique (%) 57.9%
Missing 5
Missing (%) 0.2%
Memory Size 634046276

Length

Mean 63711.8881
Standard Deviation 52170.1235
Median 50070
Minimum 337
Maximum 523393

Sample

1st row Cyclin-dependent k...
2nd row Abstract Backgrou...
3rd row Abstract Backgrou...
4th row Recent evidence ha...
5th row Oncogenic mutation...

Letter

Count 163602703
Lowercase Letter 151140210
Space Separator 31793227
Uppercase Letter 12462493
Dash Punctuation 1099569
Decimal Number 7515402
  • TEXT contains many words: 257797 words

Interactions

Correlations

Missing Values